Please refer to the video lecture for the full information on this topic and code set-up.
In this lecture we will show an example of how you could use R to perform Natural Language Processing. There is no assignment or project attached to this topic because this topic can range widely. You should also note that a lot of times R is not the best choice for NLP, and that other languages such as Python are a stronger choice due to their library support.
If you are looking for a supplemental assignment, read through the great walkthrough on NLP written here.
We'll need the following libraries:
You can install them with this code (uncomment it first):
#install.packages('tm',repos='http://cran.us.r-project.org')
#install.packages('twitteR',repos='http://cran.us.r-project.org')
#install.packages('wordcloud',repos='http://cran.us.r-project.org')
#install.packages('RColorBrewer',repos='http://cran.us.r-project.org')
#install.packages('e1017',repos='http://cran.us.r-project.org')
#install.packages('class',repos='http://cran.us.r-project.org')
This project requires you to create a twitter account and a twitter application if you want to follow along. Let's outline the steps to do this:
Then use them with the twitteR library:
getTwitterOAuth(consumer_key, consumer_secret)
Now let's review a few key Regular Expression functions we've touched upon earlier:
Return the index location of pattern matches
args(grep)
grep('A', c('A','B','C','D','A'))
length of a string
args(nchar)
nchar('helloworld')
nchar('hello world')
perform replacement of the matching patterns
args(gsub)
gsub('pattern','replacement','hello have you seen the pattern here?')
print(paste('A','B','C',sep='...'))
#help(paste)
returns the substring in the given character range start:stop for the given
substr('abcdefg',start=2,stop = 5)
splits a string into a list of substrings based on another string split in x
strsplit('2016-01-23',split='-')
library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
We'll use the twitteR library to data mine twitter. First you need to connect by setting up your Authorization keys and tokens.
setup_twitter_oauth(consumer_key, consumer_secret, access_token=NULL, access_secret=NULL)
We will search twitter for the term 'soccer'
soccer.tweets <- searchTwitter("soccer", n=2000, lang="en")
soccer.text <- sapply(soccer.tweets, function(x) x$getText())
We'll remove emoticons and create a corpus
soccer.text <- iconv(soccer.text, 'UTF-8', 'ASCII') # remove emoticons
soccer.corpus <- Corpus(VectorSource(soccer.text)) # create a corpus
We'll apply some transformations using the TermDocumentMatrix Function
term.doc.matrix <- TermDocumentMatrix(soccer.corpus,
control = list(removePunctuation = TRUE,
stopwords = c("soccer","http", stopwords("english")),
removeNumbers = TRUE,tolower = TRUE))
head(term.doc.matrix)
term.doc.matrix <- as.matrix(term.doc.matrix)
word.freqs <- sort(rowSums(term.doc.matrix), decreasing=TRUE)
dm <- data.frame(word=names(word.freqs), freq=word.freqs)
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))